import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# import gensim.utils as g_utils
import re
from nltk.tokenize.treebank import TreebankWordDetokenizer
import gensim.utils as gutils/opt/conda/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
Loading the Emotion Dataset:
To kick things off, we begin by loading our emotion dataset.
# Loading the Emotion Dataset
df = pd.read_csv('/kaggle/input/emotion-dataset/emotion-dataset.csv')df.head()| text | emotion | |
|---|---|---|
| 0 | i didnt feel humiliated | sadness |
| 1 | i can go from feeling so hopeless to so damned... | sadness |
| 2 | im grabbing a minute to post i feel greedy wrong | anger |
| 3 | i am ever feeling nostalgic about the fireplac... | love |
| 4 | i am feeling grouchy | anger |
# Displaying Descriptive Statistics
df.describe()| text | emotion | |
|---|---|---|
| count | 18000 | 18000 |
| unique | 17958 | 6 |
| top | i was so stubborn and that it took you getting... | joy |
| freq | 2 | 6057 |
Text Preprocessing: Streamlining Our Data for Analysis
In the intricate world of emotion analysis, the quality of textual data plays a pivotal role in model performance. To enhance the accuracy of our predictions, we employ a robust text preprocessing function.
def preprocess(data):
# Removing URLs with a regular expression
url_pattern = re.compile(r'https?://\S+|www\.\S+')
data = url_pattern.sub(r'', data)
# Remove emails
data = re.sub('\S*@\S*\s?', '', data)
# Remove new line characters
data = re.sub('\s+', ' ', data)
# Remove distracting single quotes
data = re.sub("\'", "", data)
return dataThis preprocess function acts as a text janitor, systematically cleaning our data by removing URLs, emails, unnecessary whitespace, and distracting single quotes.
Text Transformation: From Sentences to Words and Back
The sent_2_words function acts as our linguistic alchemist, breaking down sentences into a stream of tokenized words, ready for analysis. On the flip side, the de_tokenize function smoothly reconstructs these processed words into coherent sentences, ensuring the integrity of our text data throughout the analysis pipeline.
def sent_2_words(sentences):
"""
Tokenizes and preprocesses a list of sentences using Gensim's simple_preprocess.
Parameters:
- sentences: List of sentences to be tokenized.
Returns:
- A generator object yielding lists of words for each sentence.
"""
for sentence in sentences:
yield gutils.simple_preprocess(str(sentence), deacc=True)
def de_tokenize(text):
"""
Detokenizes a list of words using TreebankWordDetokenizer.
Parameters:
- text: List of words to be detokenized.
Returns:
- Detokenized sentence.
"""
return TreebankWordDetokenizer().detokenize(text)Label Encoding for Emotions:
In this code snippet, the LabelEncoder from scikit-learn is employed to encode the categorical labels representing emotions in the DataFrame. The fit_transform method is applied to the 'emotion' column, creating a new column named 'emotion_label' with numerical representations of the corresponding emotions. This transformation is crucial for training machine learning models that require numerical input.
from sklearn.preprocessing import LabelEncoder
# Performing Label Encoding for Emotions
df['emotion_label'] = LabelEncoder().fit_transform(df['emotion'])df.head()| text | emotion | emotion_label | |
|---|---|---|---|
| 0 | i didnt feel humiliated | sadness | 4 |
| 1 | i can go from feeling so hopeless to so damned... | sadness | 4 |
| 2 | im grabbing a minute to post i feel greedy wrong | anger | 0 |
| 3 | i am ever feeling nostalgic about the fireplac... | love | 3 |
| 4 | i am feeling grouchy | anger | 0 |
Visualizing Emotion Distribution:
In this code snippet, the distribution of emotions within the DataFrame is visualized using a bar plot. The value_counts() method is applied to the 'emotion' column to obtain the frequency of each unique emotion, and the resulting counts are plotted using the plot function with the 'bar' kind. This visualization provides a quick overview of the distribution of emotions in the dataset.
df['emotion'].value_counts().plot(kind='bar')<Axes: xlabel='emotion'>

Removing Duplicate Rows:
In this code snippet, duplicate rows in the DataFrame are identified using the duplicated() method, and their indices are stored in the index variable. Subsequently, the drop method is applied to remove the duplicate rows based on their indices, and the DataFrame is modified in place. Finally, the reset_index method is used to reindex the DataFrame after the removal of duplicate rows.
df[df['text'].duplicated() == True]| text | emotion | emotion_label | |
|---|---|---|---|
| 5067 | i feel on the verge of tears from weariness i ... | joy | 2 |
| 6133 | i still feel a craving for sweet food | love | 3 |
| 6563 | i tend to stop breathing when i m feeling stre... | anger | 0 |
| 7623 | i was intensely conscious of how much cash i h... | sadness | 4 |
| 7685 | im still not sure why reilly feels the need to... | surprise | 5 |
| 8246 | i am not amazing or great at photography but i... | love | 3 |
| 9596 | ive also made it with both sugar measurements ... | joy | 2 |
| 9687 | i had to choose the sleek and smoother feel of... | joy | 2 |
| 9769 | i often find myself feeling assaulted by a mul... | sadness | 4 |
| 9786 | i feel im being generous with that statement | joy | 2 |
| 10117 | i feel pretty tortured because i work a job an... | fear | 1 |
| 10581 | i feel most passionate about | joy | 2 |
| 11273 | i was so stubborn and that it took you getting... | joy | 2 |
| 11354 | i write these words i feel sweet baby kicks fr... | love | 3 |
| 11525 | i feel a remembrance of the strange by justin ... | fear | 1 |
| 11823 | i have chose for myself that makes me feel ama... | joy | 2 |
| 12441 | i still feel completely accepted | love | 3 |
| 12562 | i feel so weird about it | surprise | 5 |
| 12892 | i cant escape the tears of sadness and just tr... | joy | 2 |
| 13236 | i feel like a tortured artist when i talk to her | anger | 0 |
| 13879 | i feel like i am very passionate about youtube... | love | 3 |
| 14106 | i feel kind of strange | surprise | 5 |
| 14313 | i could feel myself hit this strange foggy wall | surprise | 5 |
| 14633 | i feel pretty weird blogging about deodorant b... | fear | 1 |
| 14925 | i resorted to yesterday the post peak day of i... | fear | 1 |
| 15314 | i will feel as though i am accepted by as well... | joy | 2 |
| 15328 | i shy away from songs that talk about how i fe... | joy | 2 |
| 15571 | i bet taylor swift basks in the knowledge that... | anger | 0 |
| 15704 | i began to feel accepted by gaia on her own terms | joy | 2 |
| 15875 | i was sitting in the corner stewing in my own ... | anger | 0 |
| 16261 | i realized what i am passionate about helping ... | joy | 2 |
| 16264 | i feel so blessed and honored that we get to b... | love | 3 |
| 16352 | i could feel his breath on me and smell the sw... | joy | 2 |
| 16414 | i loved the feeling i got during an amazing sl... | joy | 2 |
| 16501 | i am feeling stressed and more than a bit anxious | anger | 0 |
| 16585 | i found myself feeling inhibited and shushing ... | sadness | 4 |
| 16916 | i feel the need to pimp this since raini my be... | joy | 2 |
| 16958 | i feel cared for and accepted | love | 3 |
| 17025 | i have not conducted a survey but it is quite ... | sadness | 4 |
| 17274 | i feel so weird and scattered with all wonders... | surprise | 5 |
| 17886 | i feel like some of you have pains and you can... | joy | 2 |
# Removing Duplicate Rows from the DataFrame
index = df[df['text'].duplicated() == True].index
df.drop(index, axis = 0, inplace = True)
df.reset_index(inplace=True, drop = True)df[df['text'].duplicated() == True]| text | emotion | emotion_label |
|---|
len(df)17958
Training and Testing: The Divide for Model Mastery
As we embark on the exciting phase of model development, a critical step is to split our dataset into training and testing sets. In this code snippet, the 'text' column from the DataFrame is extracted to form the feature set (X_train and X_test), and the 'emotion' column is used to create the corresponding label sets (y_train and y_test). The dataset is divided into training and testing sets, with the first 15,000 entries used for training and the remaining entries for testing. This code snippet orchestrates the division of our dataset into training and testing sets—a fundamental step in the machine learning pipeline. The 'text' column serves as our feature set, while the 'emotion' column provides the corresponding labels. The first 15,000 entries are earmarked for training, and the remaining entries become our test set. This segregation ensures that our model is trained on a subset of the data and evaluated on unseen samples, gauging its generalization capabilities.
# Splitting the Dataset into Train and Test Sets
X_train = np.array(df['text'].values.tolist()[:15000])
X_test = np.array(df['text'].values.tolist()[15000:])
y_train = np.array(df['emotion'].values.tolist()[:15000])
y_test = np.array(df['emotion'].values.tolist()[15000:])len(X_train) == len(y_train)True
from sklearn.model_selection import train_test_split,KFold, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,confusion_matrix, classification_report
from sklearn.pipeline import Pipeline
from sklearn.metrics import f1_score
from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizerModel Training: Bridging Algorithms and Text Data
In the ever-evolving landscape of machine learning, training a model becomes an art form, especially when dealing with textual data. The train_model function creates and trains a text classification model using scikit-learn's Pipeline. It takes a model, data, and targets as input, where the model is a machine learning classifier, data represents the input features (text data), and targets are the corresponding labels. The function utilizes a TfidfVectorizer for text feature extraction and incorporates the given classifier into a pipeline.
def train_model(model, data, targets):
"""
Trains a text classification model using scikit-learn's Pipeline.
Parameters:
- model: The machine learning classifier to be trained.
- data: Input features (text data).
- targets: Corresponding labels.
Returns:
- A trained Pipeline object containing TfidfVectorizer and the given model.
"""
# Create a Pipeline object with a TfidfVectorizer and the given model
text_clf = Pipeline([('vect', TfidfVectorizer()),
('clf', model)])
# Fit the model on the data and targets
text_clf.fit(data, targets)
return text_clfLogistic Regression Model Evaluation:
In this code snippet, a Logistic Regression model is trained using the train_model function on the training data (X_train, y_train). Subsequently, the trained model is tested on the test data (X_test), and predictions are made. The accuracy of the model is calculated using the accuracy_score function. Additionally, the F1 score is computed for each emotion category and presented in a DataFrame.
#Train the model with the training data
log_reg = train_model(LogisticRegression(solver='liblinear',random_state = 0), X_train, y_train)#test the model with the test data
y_pred=log_reg.predict(X_test)
#calculate the accuracy
log_reg_accuracy = accuracy_score(y_test, y_pred)
print('Accuracy: ', log_reg_accuracy,'\n')
# Calculate and Display F1 Scores for Each Emotion
f1score = f1_score(y_test,y_pred, average=None)
pd.DataFrame(f1score, index=df.emotion.unique(), columns=['F1 Scores'])Accuracy: 0.8492224475997295
| F1 Scores | |
|---|---|
| sadness | 0.831309 |
| anger | 0.803709 |
| love | 0.875170 |
| surprise | 0.688442 |
| fear | 0.897249 |
| joy | 0.602740 |
# Generate a detailed Classification Report
print(classification_report(y_test, y_pred)) precision recall f1-score support
anger 0.92 0.76 0.83 405
fear 0.89 0.73 0.80 355
joy 0.80 0.96 0.88 1002
love 0.89 0.56 0.69 244
sadness 0.86 0.93 0.90 857
surprise 0.86 0.46 0.60 95
accuracy 0.85 2958
macro avg 0.87 0.74 0.78 2958
weighted avg 0.86 0.85 0.84 2958
Decision Tree Model Evaluation:
In this code snippet, a Decision Tree model is trained using the train_model function on the training data (X_train, y_train). Subsequently, the trained model is tested on the test data (X_test), and predictions are made. The accuracy of the model is calculated using the accuracy_score function. Additionally, the F1 score is computed for each emotion category and presented in a DataFrame. Finally, a detailed classification report is generated using the classification_report function, providing insights into precision, recall, F1-score, and support for each emotion category.
# Train the model with the training data
dec_tree = train_model(DecisionTreeClassifier(random_state=0), X_train, y_train)
# Test the model with the test data
y_pred = dec_tree.predict(X_test)
# Calculate the accuracy
DTC_accuracy = accuracy_score(y_test, y_pred)
print('Accuracy: ', DTC_accuracy, '\n')
# Calculate and Display F1 Scores for Each Emotion
f1score = f1_score(y_test, y_pred, average=None)
pd.DataFrame(f1score, index=df.emotion.unique(), columns=['F1 Scores'])
# Generate a detailed Classification Report
print(classification_report(y_test, y_pred))Accuracy: 0.837052062204192
precision recall f1-score support
anger 0.86 0.86 0.86 405
fear 0.83 0.82 0.82 355
joy 0.84 0.87 0.86 1002
love 0.74 0.73 0.73 244
sadness 0.88 0.84 0.86 857
surprise 0.62 0.69 0.66 95
accuracy 0.84 2958
macro avg 0.80 0.80 0.80 2958
weighted avg 0.84 0.84 0.84 2958
#Train the model with the training data
SVM = train_model(SVC(random_state = 0), X_train, y_train)
#test the model with the test data
y_pred=SVM.predict(X_test)
#calculate the accuracy
SVM_accuracy = accuracy_score(y_test, y_pred)
print('Accuracy: ', SVM_accuracy,'\n')
f1score = f1_score(y_test,y_pred, average=None)
pd.DataFrame(f1score, index=df.emotion.unique(), columns=['F1 Scores'])Accuracy: 0.8522650439486139
| F1 Scores | |
|---|---|
| sadness | 0.830853 |
| anger | 0.811550 |
| love | 0.873821 |
| surprise | 0.682292 |
| fear | 0.902786 |
| joy | 0.657718 |
print(classification_report(y_test, y_pred)) precision recall f1-score support
anger 0.92 0.76 0.83 405
fear 0.88 0.75 0.81 355
joy 0.79 0.97 0.87 1002
love 0.94 0.54 0.68 244
sadness 0.88 0.93 0.90 857
surprise 0.91 0.52 0.66 95
accuracy 0.85 2958
macro avg 0.89 0.74 0.79 2958
weighted avg 0.86 0.85 0.85 2958
models = pd.DataFrame({
'Model': ['Logistic Regression', 'Decision Tree','Support Vector Machine'],
'Accuracy': [log_reg_accuracy.round(2), DTC_accuracy.round(2), SVM_accuracy.round(2)]})
models.sort_values(by='Accuracy', ascending=False).reset_index().drop(['index'], axis=1)| Model | Accuracy | |
|---|---|---|
| 0 | Logistic Regression | 0.85 |
| 1 | Support Vector Machine | 0.85 |
| 2 | Decision Tree | 0.84 |
In the intricate realm of machine learning, understanding the decisions made by models is as crucial as the predictions themselves. Today, we delve into the world of Lime—Local Interpretable Model-agnostic Explanations—a powerful tool that sheds light on the inner workings of our predictive models.
Why Lime? Lime comes into play when we deal with complex models where understanding the decision process might be challenging. It acts as a bridge between the black-box nature of certain models and our need for interpretability. Lime provides insights into why a model makes specific predictions for individual instances, offering a transparent view into the otherwise opaque model.
import seaborn as sns
import nltk
#Lime
from lime import lime_text
from lime.lime_text import LimeTextExplainer
from lime.lime_text import IndexedString,IndexedCharacters
from lime.lime_base import LimeBase
from lime.lime_text import explanation
sns.set(font_scale=1.3)
nltk.download('omw-1.4')[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...
True
df.emotion.unique()array(['sadness', 'anger', 'love', 'surprise', 'fear', 'joy'],
dtype=object)
dec_tree.classes_array(['anger', 'fear', 'joy', 'love', 'sadness', 'surprise'], dtype='<U8')
X_test[15]'i have this feeling whenever i write a song and if i think that the song has legs enough to be popular or for people to really respond to it i get this feeling'
np.array(df['emotion'].values.tolist()[15015])array('joy', dtype='<U3')
Interpretation:
The model predicts 'anger' for the given custom text, and Lime's explanation highlights the presence of strongly negative terms such as "heartless" and "bitch." These terms contribute to the model's perception of anger or intense negative emotions. Lime provides clarity on the model's decision process, shedding light on the influential terms and their impact on the emotion prediction.
#c_LR = make_pipeline(tfidf, clf)
explainer_LR = LimeTextExplainer(class_names=dec_tree.classes_)
idx = 56
print("Actual Text : ", X_test[idx])
print("Prediction : ", dec_tree.predict(X_test)[idx])
print("Actual : ", y_test[idx])
exp = explainer_LR.explain_instance(X_test[idx], dec_tree.predict_proba,top_labels=5)
exp.show_in_notebook()Actual Text : i just want the best for that boy maybe i can really stop feeling like im a heartless bitch
Prediction : anger
Actual : anger
Interpretation:
The model predicts 'joy' for the given custom text, and Lime's explanation reinforces the positive sentiment embedded in the words "not sad" and "happy." The negation of "sad" contributes significantly to the positive emotion prediction. Lime provides a transparent view into the model's decision process, highlighting the influential terms and their impact on the predicted emotion.
custom_x = np.array(['i am not sad, but happy, yes not very sad'])
#c_LR = make_pipeline(tfidf, clf)
explainer_LR = LimeTextExplainer(class_names=dec_tree.classes_)
idx = 56
print("Actual Text : ", custom_x)
print("Prediction : ", dec_tree.predict(custom_x)[0])
print("Actual : ", 'joy')
exp = explainer_LR.explain_instance(custom_x[0], dec_tree.predict_proba,top_labels=5)
exp.show_in_notebook()Actual Text : ['i am not sad, but happy, yes not very sad']
Prediction : joy
Actual : joy
#c_LR = make_pipeline(tfidf, clf)
explainer_LR = LimeTextExplainer(class_names=dec_tree.classes_)
idx = 15
print("Actual Text : ", X_test[idx])
print("Prediction : ", dec_tree.predict(X_test)[idx])
print("Actual : ", y_test[idx])
exp = explainer_LR.explain_instance(X_test[idx], dec_tree.predict_proba,top_labels=5)
exp.show_in_notebook()Actual Text : i have this feeling whenever i write a song and if i think that the song has legs enough to be popular or for people to really respond to it i get this feeling
Prediction : joy
Actual : joy
#c_LR = make_pipeline(tfidf, clf)
explainer_LR = LimeTextExplainer(class_names=dec_tree.classes_)
idx = 157
print("Actual Text : ", X_test[idx])
print("Prediction : ", dec_tree.predict(X_test)[idx])
print("Actual : ", y_test[idx])
exp = explainer_LR.explain_instance(X_test[idx], dec_tree.predict_proba,top_labels=5)
exp.show_in_notebook()Actual Text : i feel so vain when i look at myself and notice how much i like my nose or how nice my face structure is
Prediction : sadness
Actual : sadness
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers import Dense, LSTM, Embedding, Bidirectionaltemp = []
#Splitting pd.Series to list
data_to_list = df['text'].values.tolist()
for i in range(len(data_to_list)):
temp.append(preprocess(data_to_list[i]))
data_words = list(sent_2_words(temp))
data = []
for i in range(len(data_words)):
data.append(de_tokenize(data_words[i]))
print(data[:5])['didnt feel humiliated', 'can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake', 'im grabbing minute to post feel greedy wrong', 'am ever feeling nostalgic about the fireplace will know that it is still on the property', 'am feeling grouchy']
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
from keras import regularizers
max_words = 5000
max_len = max([len(t) for t in data])
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(data)
sequences = tokenizer.texts_to_sequences(data)
msgs = pad_sequences(sequences, maxlen=max_len)
print(msgs)[[ 0 0 0 ... 133 1 637]
[ 0 0 0 ... 2 19 1333]
[ 0 0 0 ... 1 457 407]
...
[ 0 0 0 ... 5 7 3267]
[ 0 0 0 ... 46 8 2490]
[ 0 0 0 ... 297 3 297]]
X_train = np.array(msgs[:15000])
X_test = np.array(msgs[15000:])
y_train = np.array(df['emotion_label'].values.tolist()[:15000])
y_test = np.array(df['emotion_label'].values.tolist()[15000:])df.head()| text | emotion | emotion_label | |
|---|---|---|---|
| 0 | i didnt feel humiliated | sadness | 4 |
| 1 | i can go from feeling so hopeless to so damned... | sadness | 4 |
| 2 | im grabbing a minute to post i feel greedy wrong | anger | 0 |
| 3 | i am ever feeling nostalgic about the fireplac... | love | 3 |
| 4 | i am feeling grouchy | anger | 0 |
vocabSize = len(tokenizer.index_word) + 1!wget http://nlp.stanford.edu/data/glove.6B.zip--2024-01-02 15:45:36-- http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2024-01-02 15:45:36-- https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2024-01-02 15:45:36-- https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’
glove.6B.zip 100%[===================>] 822.24M 5.02MB/s in 2m 39s
2024-01-02 15:48:15 (5.17 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]
!unzip glove*.zipArchive: glove.6B.zip
inflating: glove.6B.50d.txt
inflating: glove.6B.100d.txt
inflating: glove.6B.200d.txt
inflating: glove.6B.300d.txt
Loading GloVe Embeddings and Creating Embedding Matrix
In this code snippet, GloVe word embeddings are loaded from a specified file (path_to_glove_file). The embeddings are parsed, and a dictionary (embeddings_index) is created, mapping words to their corresponding embedding vectors. Subsequently, an embedding matrix is constructed using this pre-trained GloVe data. The matrix is shaped to match the vocabulary size (num_tokens) and embedding dimension (embedding_dim). Words from the tokenizer's vocabulary are assigned their respective embedding vectors if available; otherwise, the embedding is set to zero.
# Read GloVE embeddings
path_to_glove_file = '/kaggle/working/glove.6B.200d.txt'
num_tokens = vocabSize
embedding_dim = 200 #latent factors or features
hits = 0
misses = 0
embeddings_index = {}# Load GloVe word embeddings from the specified file
with open(path_to_glove_file) as f:
for line in f:
word, coefs = line.split(maxsplit=1)
coefs = np.fromstring(coefs, "f", sep=" ")
embeddings_index[word] = coefs
print("Found %s word vectors." % len(embeddings_index))
# Initialize an embedding matrix for our neural network
embedding_matrix = np.zeros((num_tokens, embedding_dim))
# Assign pre-trained word vectors to our vocabulary
for word, i in tokenizer.word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# Words found in the embedding index are assigned their respective vectors
embedding_matrix[i] = embedding_vector
hits += 1
else:
# Words not found in the embedding index are set to all-zeros
# This includes the representation for "padding" and "OOV" (Out of Vocabulary)
misses += 1
print("Converted %d words (%d misses)" % (hits, misses))Found 400000 word vectors.
Converted 30056 words (2158 misses)
X_train.shape(15000, 294)
Neural Network Architecture: Bidirectional LSTM
In this code snippet, a neural network model is constructed using a Bidirectional Long Short-Term Memory (Bi-LSTM) architecture for emotion analysis. The model is compiled using the sparse categorical cross-entropy loss function and the RMSprop optimizer. The architecture comprises an embedding layer, three Bidirectional LSTM layers with varying dropout rates, and a final dense layer with a softmax activation function for multi-class classification.
# Define the sequential model
model = Sequential()
# Add an embedding layer with pre-trained weights and fixed trainable status
model.add(Embedding(vocabSize, 200, input_length=X_train.shape[1], weights=[embedding_matrix], trainable=False))
# Stack three Bidirectional LSTM layers with varying dropout rates
model.add(Bidirectional(LSTM(256, dropout=0.2, recurrent_dropout=0.2, return_sequences=True)))
model.add(Bidirectional(LSTM(128, dropout=0.3, recurrent_dropout=0.3, return_sequences=True)))
model.add(Bidirectional(LSTM(128, dropout=0.5, recurrent_dropout=0.5)))
# Add a dense layer with softmax activation for multi-class classification
model.add(Dense(6, activation='softmax'))
# Compile the model with sparse categorical cross-entropy loss and RMSprop optimizer
model.compile(loss='sparse_categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
# Display a summary of the model architecture
model.summary()Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 294, 200) 3221600
bidirectional (Bidirectiona (None, 294, 512) 935936
l)
bidirectional_1 (Bidirectio (None, 294, 256) 656384
nal)
bidirectional_2 (Bidirectio (None, 256) 394240
nal)
dense (Dense) (None, 6) 1542
=================================================================
Total params: 5,209,702
Trainable params: 1,988,102
Non-trainable params: 3,221,600
_________________________________________________________________
from keras.utils.vis_utils import plot_model
plot_model(model, show_shapes=True)
#to stop the training when the loss starts to increase
callback = EarlyStopping(
monitor="val_loss",
patience=4,
restore_best_weights=True,
)Training Model and Achieving 90% Accuracy:
This code snippet encapsulates the training process of your Bidirectional LSTM model. The achieved accuracy of 90% is indicative of the model's effectiveness in learning and generalizing from the provided emotional text data.
# Fit model
history = model.fit(X_train,
y_train,
validation_data=(X_test, y_test),
verbose=1,
batch_size=256,
epochs=15,
callbacks=[callback]
)Epoch 1/5
59/59 [==============================] - 349s 6s/step - loss: 0.3502 - accuracy: 0.8705 - val_loss: 0.3633 - val_accuracy: 0.8611
Epoch 2/5
59/59 [==============================] - 349s 6s/step - loss: 0.3106 - accuracy: 0.8816 - val_loss: 0.2811 - val_accuracy: 0.8884
Epoch 3/5
59/59 [==============================] - 349s 6s/step - loss: 0.2844 - accuracy: 0.8884 - val_loss: 0.2763 - val_accuracy: 0.8925
Epoch 4/5
59/59 [==============================] - 349s 6s/step - loss: 0.2492 - accuracy: 0.9024 - val_loss: 0.2546 - val_accuracy: 0.9013
Epoch 5/5
59/59 [==============================] - 348s 6s/step - loss: 0.2324 - accuracy: 0.9101 - val_loss: 0.2418 - val_accuracy: 0.9009
In our expedition through the vast realm of machine learning, our venture into the world of emotion analysis has been a captivating odyssey filled with discovery, experimentation, and innovation. From the foundational steps of preparing our data to the intricate choreography between interpretable models and pre-trained embeddings, each stage has played a vital role in crafting a resilient and insightful emotion analysis system.
The harmony between conventional machine learning models and more advanced architectures, such as Bidirectional LSTMs, has uncovered the potency of contextual comprehension. This revelation allows us to unravel the complexities embedded within textual data, shedding light on the nuanced expressions of emotion. Our exploration into tools like Lime has acted as a guiding lantern, casting clarity on the decision-making mechanisms of these models, fostering transparency and trust in our analytical journey.
The significance of pre-trained word embeddings, especially the GloVe embeddings, cannot be overstated. They have proven to be the lifeblood of our models, infusing words with semantic richness. This infusion enables our neural network to not just understand but truly feel the subtle shades of emotion encoded in text, achieving an impressive accuracy of 90%.